Add thunder benchmarks #3394

Priya2698 · 2024-11-12T17:00:27Z

Adds thunder as an additional executor to the baseline benchmarks and the corresponding thunder.jit function.
The following benchmarks do not have thunder benchmark:

instancenorm: Unsupported operator in Thunder
test_gelu_backward_reduction.py: .backward call is not supported within Thunder definitions. @IvanYashchuk has suggested using explicit backward computation for this case.

Issue #2718

Priya2698 · 2024-11-12T17:11:20Z

!build

liqiangxl · 2024-11-13T19:38:00Z

!test --pybench

liqiangxl · 2024-11-13T19:39:01Z

Can you check how does the benchmark look like? Make sure there are no unexpected results.

xwang233 · 2024-11-14T05:36:30Z

!test --pybench-full --dev

see results here when pipeline finishes

benchmarks/python/test_gelu_bwd.py

Priya2698 · 2024-12-10T04:59:22Z

!test --pybench-full --dev

Priya2698 · 2024-12-10T19:30:17Z

!test --pybench-full --dev

Priya2698 · 2024-12-11T00:25:01Z

!test --pybench-full --dev

jjsjann123 · 2024-12-16T23:25:25Z

benchmarks/python/core.py

@@ -325,6 +327,9 @@ def run_benchmark(
    def setup():
        clear_l2_cache()
        if device == "cuda":
+            for inp in inputs:
+                if isinstance(inp, torch.Tensor):
+                    inp.grad = None


Thank you for this one. But this is only the cases where input requires gradient. Are we also clearing gradient on parameters?

Do you mean, for instance, weights in layernorm? Then, yes.

I'm curious how this works in code.

If benchmark_fn is not a function but a torch module, in that instance, the thunder program doesn't expect parameters to be among its inputs, I think it's stored in the thunder compiled thing. So I'm not sure how that's handled.

i.e. something like this

foo = torch.nn.Linear(4, 5).cuda() inp = torch.randn(8, 4, device="cuda") benchmark_fn = with_executor(foo, "thunder") # ... run_benchmark(benchmark, unary_bwd_torch, [output, grad], ...)

Ahh you're right.
Even with clearing the gradients of weights, bias and inputs in backward pass, I think it is still missing some variables/internal states that need to be reset.
The simplest way is to only run 1 round for backward, but I feel that may be noisy, so have trying to make it work for multiple runs.

Got'ya. No worries. I'm not totally clear what's the protocol in thunder on ownership of parameters, I think it's supposed to be a functional compilation.
So we can still expect that with_executor has the chance to extract parameters from nn.Module if it's given for benchmark and we should be able to identify parameter that needs zero_grad. just like an optimizer would.

BTW, this could also contribute to potential performance diff.

if there are parameter requiring grad, thunder will generate backward graph and save intermediates, regardless of whether backward is being called or not.

Priya2698 · 2024-12-18T00:01:10Z

!test --pybench-full --dev

Priya2698 · 2024-12-18T00:23:11Z

!test --pybench-full --dev

Priya2698 · 2024-12-19T17:51:53Z

We still see some performance differences between Thunder and nvfuser manual definitions, for some operators, which may be due to slight differences in the fusion definitions generated. I will look into this operator-wise.

I have compared the measured timings against nsys timings for some operators to verify the accuracy of the benchmark infra. My recommendation is to check in these changes adding Thunder benchmarks, while we investigate the performance gap.

jjsjann123

I think there're still quite some questions on what's the next step and how to automate things like IOBytes computation and stuff.

But for this PR, things look pretty mechanical, so I'm good with merging it as-is then iron out the remaining issues.

My only concern is, would the disruption rendering our benchmark not reliable until we fix all these issues? and does it cause issues on folks looking at the benchmarks?

jjsjann123 · 2024-12-19T23:41:11Z

benchmarks/python/test_softmax_bwd.py

@@ -115,6 +121,6 @@ def test_softmax_bwd_baseline_benchmark(
    run_benchmark(
        benchmark,
        unary_bwd_torch,
-        [outputs, grads],
+        [outputs, grads, *fwd_inputs],


sneaky! I see unary_bwd_torch discards further inputs, but you pass them here to clear the grad?
We can use a comment in unary_bwd_torch on why we are not asserting num of inputs.

Priya2698 · 2024-12-20T00:17:18Z

My only concern is, would the disruption rendering our benchmark not reliable until we fix all these issues? and does it cause issues on folks looking at the benchmarks?

The thunder.jit performance numbers match against nsys. They also compare correctly for operators like scale-bias-relu, softmax, silu-mul against the manual nvfuser definitions. But for others like normalization, the performance is different. So my next step is to actually compare the fusion definitions, and account how each difference in the 2 fusion definitions correspond to the performance gap (for instance, the manual fusion definitions have some missing downcasts for intermediate ops, or the ordering of ops etc).

So by performance gap, I imply the latter case. The measurements of the benchmarks should be accurate.
This PR is actually fixing one of the measurement issues existent in the backward benchmarks due to grad accumulation.

Priya2698 · 2024-12-20T00:19:12Z

I think there're still quite some questions on what's the next step and how to automate things like IOBytes computation and stuff.

Agreed! Let me open an issue tracking the IOByte case. I will create an issue for the differences between thunder.jit and nvfuser manual definition performance shortly (I want to do some initial analysis).

Priya2698 · 2024-12-20T00:19:49Z

!build

github-actions · 2025-01-16T02:30:26Z

PR Reviewer Guide 🔍

(Review updated until commit `083d89d`)

Here are some key observations to aid the review process:

⏱️ Estimated effort to review: 3 🔵🔵🔵⚪⚪
🧪 No relevant tests
⚡ Recommended focus areas for review Changed Function Signature The `unary_bwd_torch` function signature has been changed to accept additional arguments. This change may affect the functionality of the benchmarks. def unary_bwd_torch(inputs: List): # [output, grad_out, fwd_inputs] inputs[0].backward(inputs[1], retain_graph=True) Added New Executor A new executor named "thunder" has been added to the `DEFAULT_EXECUTORS` list. This change may affect the behavior of the benchmarks. DEFAULT_EXECUTORS = ["eager", "torchcompile", "thunder"] Modified Benchmark Function The `run_benchmark` function has been modified to clear the L2 cache before running the benchmark. This change may affect the performance measurements. def setup(): clear_l2_cache() if device == "cuda": for inp in inputs: if isinstance(inp, torch.Tensor): inp.grad = None return [inputs], {}

Priya2698 · 2025-01-16T03:09:06Z

!test --pybench-full --dev

Priya2698 · 2025-01-16T22:50:18Z

@jacobhinkle @liqiangxl pinging for review.

We continue to see performance difference between Thunder-nvfuser and nvfuser. The thunder-nvfuser executor will not run by default so we can continue investigating this in parallel. This PR has the fix for bwd grad accumulation issue, so we can merge this. If the consensus is to hold off on adding Thunder-nvfuser benchmarks altogether, I can remove it from the executors list for benchmarks as well. That change is minimal.

Latest benchmark run: http://nv/euP

nvFuser performance remains the same.
Eager and torch.compile remain the same for fwd and are faster for bwd.
Thunder-nvfuser and nvfuser performance is different in some cases, mostly bwd pass. See Issue Performance gap between manual nvfuser definition and thunder.jit #3629 for an example.

@jjsjann123 Can you give me the numbers you expect for RoPE bwd after the grad accumulation issue is resolved -- I can verify the numbers with this PR?

Priya2698 · 2025-01-16T23:05:40Z

benchmarks/python/core.py

@@ -23,6 +22,8 @@
 L2_CACHE_SIZE = DEVICE_PROPERTIES["gpu_l2_bytes"]
 PEAK_BANDWIDTH_GBPS = DEVICE_PROPERTIES["gpu_peak_bandwidth_gbps"]

+DEFAULT_EXECUTORS = ["eager", "torchcompile", "thunder"]


Maybe this should be named differently since these are not run in nightly, but for most benchmarks, these are the set of executors we execute weekly. We also have thunder-torchcompile for RoPE.
Maybe BASELINE_EXECUTORS is better, although Thunder is not really a baseline.

Priya2698 · 2025-01-16T23:06:39Z

!build

Priya2698 marked this pull request as ready for review November 12, 2024 17:11

Priya2698 requested review from xwang233 and liqiangxl November 12, 2024 17:11

jacobhinkle reviewed Nov 15, 2024

View reviewed changes

benchmarks/python/test_gelu_bwd.py Outdated Show resolved Hide resolved

Priya2698 force-pushed the pm/add_thunder_bench branch from ce46a99 to ec697f0 Compare November 20, 2024 18:13

Priya2698 marked this pull request as draft November 20, 2024 18:38

Priya2698 force-pushed the pm/add_thunder_bench branch from ec697f0 to 2d9b5d3 Compare November 20, 2024 18:41

Priya2698 force-pushed the pm/add_thunder_bench branch from 2d9b5d3 to 15a9f50 Compare December 9, 2024 20:46

naoyam mentioned this pull request Dec 10, 2024

rope_benchmark #3550

Merged

1 task

Priya2698 force-pushed the pm/add_thunder_bench branch from d69811f to ff1373f Compare December 10, 2024 19:20

jjsjann123 reviewed Dec 16, 2024

View reviewed changes

Priya2698 marked this pull request as ready for review December 19, 2024 00:50

Priya2698 requested review from jacobhinkle and jjsjann123 December 19, 2024 17:52

jjsjann123 reviewed Dec 19, 2024

View reviewed changes

Priya2698 mentioned this pull request Dec 20, 2024

Performance gap between manual nvfuser definition and thunder.jit #3629

Open

wip with executor change

85c5328

Priya2698 added 5 commits January 15, 2025 18:26

fix minor errors

219f8a3

add import

29dd738

import

68454ee

rebase on import fix

4fe7ad3

clear grads

b7e0b3e

Priya2698 force-pushed the pm/add_thunder_bench branch from 84bb09c to b7e0b3e Compare January 16, 2025 02:29

Priya2698 added 3 commits January 15, 2025 18:37

fix rebase

7821174

lint

615c0e0

default exec variable

dd302c9

Priya2698 requested a review from jjsjann123 January 16, 2025 22:38

comment

083d89d

Priya2698 commented Jan 16, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add thunder benchmarks #3394

Add thunder benchmarks #3394

Priya2698 commented Nov 12, 2024 •

edited

Loading

Priya2698 commented Nov 12, 2024

liqiangxl commented Nov 13, 2024

liqiangxl commented Nov 13, 2024

xwang233 commented Nov 14, 2024 •

edited

Loading

Priya2698 commented Dec 10, 2024

Priya2698 commented Dec 10, 2024

Priya2698 commented Dec 11, 2024

jjsjann123 Dec 16, 2024

Priya2698 Dec 16, 2024 •

edited

Loading

jjsjann123 Dec 16, 2024

Priya2698 Dec 17, 2024

jjsjann123 Dec 17, 2024

jjsjann123 Dec 19, 2024

Priya2698 commented Dec 18, 2024

Priya2698 commented Dec 18, 2024

Priya2698 commented Dec 19, 2024

jjsjann123 left a comment

jjsjann123 Dec 19, 2024

Priya2698 commented Dec 20, 2024

Priya2698 commented Dec 20, 2024

Priya2698 commented Dec 20, 2024

github-actions bot commented Jan 16, 2025 •

edited

Loading

Priya2698 commented Jan 16, 2025

Priya2698 commented Jan 16, 2025 •

edited

Loading

Priya2698 Jan 16, 2025

Priya2698 commented Jan 16, 2025

Add thunder benchmarks #3394

Are you sure you want to change the base?

Add thunder benchmarks #3394

Conversation

Priya2698 commented Nov 12, 2024 • edited Loading

Priya2698 commented Nov 12, 2024

liqiangxl commented Nov 13, 2024

liqiangxl commented Nov 13, 2024

xwang233 commented Nov 14, 2024 • edited Loading

Priya2698 commented Dec 10, 2024

Priya2698 commented Dec 10, 2024

Priya2698 commented Dec 11, 2024

jjsjann123 Dec 16, 2024

Choose a reason for hiding this comment

Priya2698 Dec 16, 2024 • edited Loading

Choose a reason for hiding this comment

jjsjann123 Dec 16, 2024

Choose a reason for hiding this comment

Priya2698 Dec 17, 2024

Choose a reason for hiding this comment

jjsjann123 Dec 17, 2024

Choose a reason for hiding this comment

jjsjann123 Dec 19, 2024

Choose a reason for hiding this comment

Priya2698 commented Dec 18, 2024

Priya2698 commented Dec 18, 2024

Priya2698 commented Dec 19, 2024

jjsjann123 left a comment

Choose a reason for hiding this comment

jjsjann123 Dec 19, 2024

Choose a reason for hiding this comment

Priya2698 commented Dec 20, 2024

Priya2698 commented Dec 20, 2024

Priya2698 commented Dec 20, 2024

github-actions bot commented Jan 16, 2025 • edited Loading

PR Reviewer Guide 🔍

(Review updated until commit 083d89d)

Priya2698 commented Jan 16, 2025

Priya2698 commented Jan 16, 2025 • edited Loading

Priya2698 Jan 16, 2025

Choose a reason for hiding this comment

Priya2698 commented Jan 16, 2025

Priya2698 commented Nov 12, 2024 •

edited

Loading

xwang233 commented Nov 14, 2024 •

edited

Loading

Priya2698 Dec 16, 2024 •

edited

Loading

github-actions bot commented Jan 16, 2025 •

edited

Loading

(Review updated until commit `083d89d`)

Priya2698 commented Jan 16, 2025 •

edited

Loading